Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCHEMA] Add TSV column files #827

Merged
merged 45 commits into from
Nov 9, 2021
Merged

Conversation

tsalo
Copy link
Member

@tsalo tsalo commented Jul 6, 2021

Closes None.

Tagging @kayrobbins so she can take a look.

Changes proposed:

  • Add a YAML file describing columns for TSV files in the specification.
  • Add a macro and Python function to render column tables.
  • Replace hardcoded tables in the specification with macro-generated ones.
    • Currently, the generated tables are provided in addition to the hardcoded tables, for easy comparison during review.

To do:

@CPernet
Copy link
Collaborator

CPernet commented Jul 9, 2021

can I ask why using YAML? for many people BIDS is scary just because we have tsv and json ... this adds yet another layer -- I think we need to be careful

@tsalo
Copy link
Member Author

tsalo commented Jul 9, 2021

The main reason we chose yaml over json is that yaml allows comments. Also, the schema files describing the specification aren't things that BIDS users will interact with, any more than they spend time with the pybids or bids-validator code.

@CPernet
Copy link
Collaborator

CPernet commented Jul 9, 2021

ah I see - ok cool thx, just wanted to know more

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented Jul 9, 2021 via email

tsalo added 2 commits July 9, 2021 11:48
Note that there are three very different uses of "name" columns, and two of them are equally common, so I chose not to specify any of them as the "canonical" definition.
@tsalo tsalo added the schema Issues related to the YAML schema representation of the specification. Patch version release. label Jul 22, 2021
@erdalkaraca
Copy link
Collaborator

Made a mental note yesterday though that I will need to double check how this splitting of content in several files and format might affect "on-boarding" of people who want to contribute to specifications (especially for BEP content but also in general). Will try to remember to raise the issue at the next maintainers meeting I attend.

Sent from my Galaxy

Normally (speaking from experience), a technical schema is somehow self-contained. Indeed, the many yaml files may scare off contributors, but, as Taylor mentioned, they are rather intended for automatic processing via tooling

@Tokazama
Copy link
Member

If the intent is to provide a machine readable method, multiple files and directories on github is more difficult. It's pretty simple to download raw files, but the easiest way to manage the many schema files is to clone the whole repo. It's not an unmanageable obstacle, just a bit obnoxious.

@tsalo
Copy link
Member Author

tsalo commented Sep 22, 2021

I will consolidate the column files into a single file if #883 is merged. I'm planning to hold off on changing the structure in this PR until that PR is handled.

@effigies
Copy link
Collaborator

@tsalo Apologies for taking so long. Which PR should be considered higher priority?

@tsalo
Copy link
Member Author

tsalo commented Sep 22, 2021

This PR contains content that needs to be reviewed, while the other really just needs a quick once-over from anyone who uses the schema. The changes in #883 include updates to the internal schema-processing code, so nothing in the specification should really change there.

In summary, I think that #883 is higher priority, but this PR will need a more thorough review.

@tsalo
Copy link
Member Author

tsalo commented Sep 24, 2021

@effigies just a small note to follow up from our call earlier this week. It looks like the JSON fields used to describe TSV columns (e.g., "LongName", "Description", "Levels") are included in the metadata YAML files, so they won't need to be added here.

Copy link
Collaborator

@Remi-Gau Remi-Gau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you are planning to add more to this but so far this looks good to me.

Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I keep starting and having to drop this. Here are some comments in my queue before I start a new review.

Comment on lines +310 to +318
- cell line
- in vitro differentiated cells
- primary cell
- cell-free sample
- cloning host
- tissue
- whole organisms
- organoid
- technical sample
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we automatically turn this into the following?

One of "cell line", "in vitro differentiated cells", [...]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think automatically incorporating value restrictions like enums would be awesome. I'd prefer to tackle it in a separate PR though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #912 about this.

src/schema/objects/columns.yaml Outdated Show resolved Hide resolved
Hexadecimal. Label color for visualization.
type: string
unit: hexadecimal
derived_from:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the column header "derived_from" seems fine, perhaps we should have a more descriptive term name:

Suggested change
derived_from:
sample_derived_from:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current approach to multiply-defined objects is to start with the base name and only create a new definition if a new use for that object is proposed with a definition that is incompatible with the original. I'd prefer not to link objects to specific datatypes or sections until it becomes necessary, but others may feel differently.

When we do have multiple definitions of the same object, I want to make sure that the names appear together and that it's clear where the difference from the "name" comes from, so I would lean toward something like derived_from__sample. I added a section to the schema README about duplicate terms with different definitions, so hopefully that will be more clear now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the last remaining blocker, and what we decide here should apply across the schema. Here is what I added to the schema README. @effigies @Remi-Gau @sappelhoff WDYT?

If an object may mean something different depending on where it is used within the specification,
then this must be reflected in the schema.
Specifically, each version of the object must have its own definition within the relevant file.
However, since object files are organized as dictionaries, each object must have a unique key.
Thus, we append a suffix to each re-used object's key in order to make it unique.
For objects with CamelCase names (for example, metadata fields), the suffix will start with a single underscore (_).
For objects with snake_case names, two underscores must be used.

There should also be a comment near the object definition in the YAML file describing the nature of the different objects.

For example, the TSV column "reference" means different things when used for EEG data, as compared to iEEG data.
As such, there are two definitions in columns.yaml for the "reference" column: "reference__eeg" and "reference_ieeg".

# reference column for channels.tsv files for EEG data
reference__eeg:
  name: reference
  description: |
    Name of the reference electrode(s).
    This column is not needed when it is common to all channels.
    In that case the reference electrode(s) can be specified in `*_eeg.json` as `EEGReference`).
  type: string
# reference column for channels.tsv files for iEEG data
reference__ieeg:
  name: reference
  description: |
    Specification of the reference (for example, 'mastoid', 'ElectrodeName01', 'intracranial', 'CAR', 'other', 'n/a').
    If the channel is not an electrode channel (for example, a microphone channel) use `n/a`.
  anyOf:
  - type: string
  - type: string
    enum:
    - n/a

When adding new object definitions to the schema,
every effort should be made to find a shared, common definition for the term, should it already exist.
If the differences between two versions of the same object are subtle or driven by context,
then you can generally append additional text to the object definition within the associated rendered table in the specification,
rather than creating a separate entry in the schema.

Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a proposal for including common text in both samples and sessions.

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved
src/03-modality-agnostic-files.md Show resolved Hide resolved
src/schema/objects/columns.yaml Outdated Show resolved Hide resolved
of the strain of the species, for example: `RRID:IMSR_JAX:000664`.
type: string
format: rrid
time:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea to give this a specific name like blood_sample_time.

@effigies
Copy link
Collaborator

Since I think @sappelhoff covered *EG, I read all rendered pages except that.

I have not gone through the actual schema file in detail or the Python (at least recently). Please let me know if you'd like me to.

Comment on lines +186 to +196
{{ MACROS___make_columns_table(
{
"participant_id": ("REQUIRED", "There MUST be exactly one row for each participant."),
"species": "RECOMMENDED",
"age": "RECOMMENDED",
"sex": "RECOMMENDED",
"handedness": "RECOMMENDED",
"strain": "RECOMMENDED",
"strain_rrid": "RECOMMENDED",
}
) }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it accurate to say that this macro represents a BIDS rule for the `participants.tsv" file? If so, should this be described in the schema somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the tables do represent rules, but those rules are too complicated to be translated directly from the tables. For more information on the types of mechanisms we need to support in the schema to accurately codify the specification's rules, please see #620.

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with this, thanks Taylor.

@sappelhoff sappelhoff added this to the 1.7.0 milestone Nov 9, 2021
@tsalo tsalo merged commit 5400d6f into bids-standard:master Nov 9, 2021
@tsalo tsalo deleted the tsv-columns-schema branch November 9, 2021 19:02
@tsalo tsalo added the schema-code Updates or changes to the code used to parse, filter, and render the schema. label Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema Issues related to the YAML schema representation of the specification. Patch version release. schema-code Updates or changes to the code used to parse, filter, and render the schema.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants